Skip to content

Conversation

@SamuelMarks
Copy link
Collaborator

@SamuelMarks SamuelMarks commented Oct 24, 2025

Description

Refactor: 3.1 phase of RESTRUCTURE.md, focussing on setup.sh and related Python dependency & Dockerfile files

Note: merge this before #2361; after which that PR is roughly halved.

Tests

CI and manual:

$ bash ./dependencies/scripts/docker_build_dependency_image.sh DEVICE='tpu' MODE='nightly'
$ bash ./dependencies/scripts/docker_build_dependency_image.sh DEVICE='tpu' MODE='stable'
$ export MODEL_NAME='llama3_1_70b_8192_synthetic' \
         PROJECT="${GOOGLE_CLOUD_PROJECT?}" \
         ZONE="${GOOGLE_CLOUD_ZONE?}" \
         CLUSTER_NAME="${GOOGLE_CLOUD_CLUSTER_NAME?}" \
         OUTPUT_DIR="${GOOGLE_CLOUD_BUCKET?}" \
         BASE_OUTPUT_DIR="${GOOGLE_CLOUD_BUCKET?}"'/output/' \
         DATASET_PATH="${GOOGLE_CLOUD_BUCKET?}"'/' \
         WORKLOAD='job_name_goes_here'

$ python3 -m MaxText.train MaxText/configs/base.yml \
      run_name="${USER}"'002' \
      base_output_directory="${OUTPUT_DIR?}" \
      dataset_type='synthetic' \
      steps='10' \
      model_name='llama2-7b'

$ command='python3 -m MaxText.train MaxText/configs/base.yml
      base_output_directory='"${BASE_OUTPUT_DIR?}"'
      dataset_path='"${DATASET_PATH?}"'
      steps=100
      per_device_batch_size=1'

$ xpk workload create \
      --base-docker-image 'maxtext_base_image' \
      --zone "${ZONE?}"
      --cluster "${CLUSTER_NAME?}" \
      --workload "${WORKLOAD?}" \
      --tpu-type='v6e-256' \
      --num-slices='1' \
      --command "${command?}"

$ python3 -m benchmarks.benchmark_runner xpk \
      --base_docker_image='maxtext_base_image' \
      --project="${PROJECT?}" \
      --zone="${ZONE?}" \
      --cluster_name="${CLUSTER_NAME?}" \
      --device_type='v6e-256' \
      --num_slices='1' \
      --base_output_directory="${OUTPUT_DIR?}" \
      --model_name="${MODEL_NAME?}"

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@SamuelMarks SamuelMarks force-pushed the setup-deps-refactor branch 3 times, most recently from 03c95d4 to c8b8de8 Compare October 29, 2025 17:33
@kanglant
Copy link
Collaborator

Why do we need to relocate the Dockerfiles and the requirement files?

@SamuelMarks
Copy link
Collaborator Author

Why do we need to relocate the Dockerfiles and the requirement files?

@kanglant This was agreed upon a long time ago in the RESTRUCTURE.md file. See internal design docs for more details.

@bvandermoon
Copy link
Collaborator

Why do we need to relocate the Dockerfiles and the requirement files?

@kanglant This was agreed upon a long time ago in the RESTRUCTURE.md file. See internal design docs for more details.

@SamuelMarks can we still have the base_requirements and generated_requirements structure within the new directory? These didn't exist at the time but it would help with organization to keep them

@SamuelMarks SamuelMarks force-pushed the setup-deps-refactor branch 6 times, most recently from 1cb7a0e to 0f56df0 Compare October 30, 2025 19:49
@SamuelMarks SamuelMarks force-pushed the setup-deps-refactor branch 2 times, most recently from 5504888 to 42be91b Compare October 31, 2025 17:14
# Copy assets separately
COPY src/MaxText/assets/ "${MAXTEXT_ASSETS_ROOT}"
COPY src/MaxText/test_assets/ "${MAXTEXT_TEST_ASSETS_ROOT}"
COPY generated_requirements .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this removed intentionally? I guess it just isn't used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would like to know. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That whole folder has been moved under dependencies so it's redundant and incorrect to retain.

# Copy assets separately
COPY src/MaxText/assets/ "${MAXTEXT_ASSETS_ROOT}"
COPY src/MaxText/test_assets/ "${MAXTEXT_TEST_ASSETS_ROOT}"
COPY generated_requirements .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would like to know. Thanks!

Copy link
Collaborator

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more comments / questions.

Given the extensive changes to docker files and setup.sh, how are we testing these two ?

@copybara-service copybara-service bot merged commit 83b3519 into AI-Hypercomputer:main Nov 3, 2025
65 checks passed
gpupuck added a commit to NVIDIA/JAX-Toolbox that referenced this pull request Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants